ALIEN-59: K8s agent self-update — chart guardrails, /v1/rejoin recovery, persistence default#69
Conversation
…IEN-59)
Squashed replay of the K8s self-update feature onto current main:
- `alien-agent`: new `agent_upgrade` loop that consumes `agent_target.helm`
from /v1/sync and spawns a Helm-upgrade Job using the upgrader SA; sync
loop persists deployment-scoped token across restarts.
- `alien-helm`: chart guardrails — `required management.{url,name,token,
deploymentId}`, /livez + /readyz probes, Recreate strategy, upgrader SA
+ namespace-scoped Role, `upgrader.enabled` values block.
- `alien-core`: sync schema gains `agent_version`, `agent_os`, `agent_arch`,
`regime`, `agent_image_repository`; K8s worker transport set to `local`.
- `alien-manager`: PUT /v1/deployments/:id/target-agent-version; sync emits
`agent_target` when `target_agent_version` differs from the agent's
reported version; sqlite store + migrations + ReconcileData carry the
new inventory; sync-token persistence reuses across restarts.
- `alien-infra`: K8s ServiceAccount controller is skipped — the chart owns
it, controllers reference it by name.
- `alien-deployment`: Kubernetes platform routes through local env-info
collection (k8s-on-cloud picks the base cloud via existing context).
Copy the regenerated `apps/api/openapi.json` from the platform repo so the Rust SDK (`alien-platform-api`, generated via `progenitor` at build time) picks up the new `PUT /v1/deployments/:id/target-agent-version` endpoint, the extended SyncReconcileRequest fields, and the `basePlatform` property on `PersistImportedDeploymentRequest`.
Without a PVC backing `/var/lib/alien-agent`, the agent's `deployment_id` + deployment-scoped sync token live in an `emptyDir` and are wiped on any pod restart. The self-update flow (manager emits `agent_target.helm` → agent spawns helm-upgrade Job → Job rolls the agent Deployment) triggers exactly such a restart, and the new pod re-runs `/v1/initialize`, hits a DEPLOYMENT_NAME_ALREADY_EXISTS 409, CrashLoopBackOffs, helm `--atomic` times out, and rolls back. The feature is silently unusable. Default `runtime.data.persistence.enabled` to `true` so self-update survives the pod roll on any cluster with a default StorageClass. Operators on clusters without one can still set `storageClassName`, point at an `existingClaim`, or explicitly opt out and accept that self-update won't survive a pod roll. Helm snapshot tests rebaselined.
The agent's `data_dir` (`/var/lib/alien-agent`) holds its persistent
`deployment_id` + deployment-scoped sync token. When the chart leaves
`runtime.data.persistence` off (or anything else wipes that directory —
pod eviction, manual reset, etc.) the agent restarts with no stored
state and falls into the "first startup" branch, hits `/v1/initialize`,
and the platform rejects with `DEPLOYMENT_NAME_ALREADY_EXISTS 409`
because the row is still there. Result: CrashLoopBackOff, and if the
restart was driven by self-update's helm-upgrade Job, helm `--atomic`
times out and rolls back.
Add `POST /v1/rejoin` so the agent can explicitly re-attach to an
existing deployment by name. Same dg bearer the chart mounts, no body
beyond the name; response is `{ deploymentId, token }` with a freshly
minted deployment-scoped token. The agent code-path catches the 409
from initialize and falls through to rejoin instead of erroring.
- `alien-manager`: new `RejoinRequest`/`RejoinResponse`, `rejoin`
handler that mirrors the existing initialize idempotency branch
(look up by `(dg, name)`, mint a fresh token), wired through a new
`rejoin_router()` so multi-tenant embedders can override it the same
way they override initialize. `RouterOptions::include_rejoin` + a
`skip_rejoin()` builder method follow the existing pattern.
- `alien-agent`: on 409 from initialize, resolve the deployment name
the same way initialize did (CLI arg / env / hostname) and call the
new `rejoin_with_manager` helper. The recovered `(deployment_id,
token)` are written back to `data_dir` like a normal init, so
subsequent restarts go through the stored-state branch.
- `alien-manager-api` SDK regenerated; OSS `/v1/rejoin` is exposed in
the openapi schema for downstream tooling.
The agent's initialize → rejoin fall-through keyed off `http_status_code == Some(409)`, but `initialize_with_manager` wrapped every SDK error in `ConfigurationError` whose hardcoded `http_status_code = 500` clobbered the real status. Result: agent received "Unexpected response: 409 Conflict" from the manager, hit the `.context(ConfigurationError)` wrap, the outer status became 500, and the agent crashed instead of falling through to `/v1/rejoin`. - New `DeploymentNameAlreadyExists` variant on `ErrorData` (409, non-retryable, non-internal) so the 409 path has a distinct outer error. - `initialize_with_manager` rebuilt: 409 from the SDK is re-emitted as `DeploymentNameAlreadyExists`; everything else still goes through `ConfigurationError` so existing error paths are unchanged. Confirmed live: after wiping `/var/lib/alien-agent/*` and bouncing the pod, the agent now logs `Name already exists — assuming local state was wiped, rejoining…`, calls `/v1/rejoin`, recovers the same deployment_id, and stays running.
…-update-for-the-alien-agent-across-all-pull-mode-2 # Conflicts: # client-sdks/platform/openapi.json # client-sdks/platform/rust/openapi-3.0.json # client-sdks/platform/rust/openapi.json # crates/alien-helm/src/generator.rs # crates/alien-helm/tests/generator/boot_paths_tests.rs # crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__data_layer.snap # crates/alien-helm/tests/generator/snapshots/generator__generator__helpers__manager_only_pure_worker.snap
Greptile SummaryThis PR lands manager-driven K8s agent self-update (Helm-runner Job spawned from the agent on
Confidence Score: 3/5The self-update happy path is solid, but a failed helm-runner Job silently blocks all retry attempts for up to 10 minutes with no operator warning. The core upgrade loop, rejoin recovery, chart guardrails, and persistence defaults are well-reasoned and tested. The 409-on-failed-Job case in create_job quietly absorbs what should be a retryable error, and min_supported_version being pinned to the target version would block all upgrades once the agent enforces that field. The upgrade path is a new critical flow touching the agent self-replacement mechanism, so these gaps carry more weight than they would in a peripheral feature. crates/alien-agent/src/loops/agent_upgrade.rs (failed-Job 409 handling) and crates/alien-manager/src/routes/sync.rs (build_agent_target min_supported_version) need the most attention before this ships to production.
|
| Filename | Overview |
|---|---|
| crates/alien-agent/src/loops/agent_upgrade.rs | New Helm-runner Job spawner; 409 from a failed (but not yet TTL'd) Job is silently treated as idempotent success, stalling upgrades for up to 10 min with no operator warning. |
| crates/alien-manager/src/routes/sync.rs | New agent_sync inventory persistence and build_agent_target; min_supported_version is set equal to the target version, which would block all agents from upgrading once that field is enforced. |
| crates/alien-manager/src/routes/deployments.rs | New PUT /v1/deployments/:id/target-agent-version endpoint; auth, authz, and semver validation are present; is_plausible_semver rejects valid versions with both pre-release and build metadata. |
| crates/alien-helm/src/generator.rs | Chart guardrails (required fields, Recreate strategy, persistence default, upgrader SA/Role/ClusterRole), encryption key auto-generation, and /livez+/readyz probe defaults all look correct. |
| crates/alien-agent/src/cli.rs | Adds rejoin fall-through on 409 from initialize and sync-token persistence; logic and error routing look correct. |
| crates/alien-manager/src/stores/sqlite/deployment.rs | update_agent_metadata and set_target_agent_version are correctly scoped; neither updates the updated_at column, leaving deployment timestamps stale after inventory syncs. |
| crates/alien-core/src/sync.rs | SyncRequest/SyncResponse extended with new optional fields; AgentTarget, AgentHelmTarget, AgentBinaryTarget structs and back-compat tests look correct. |
| crates/alien-infra/src/core/executor.rs | Kubernetes ServiceAccounts are correctly skipped at controller-lookup time and auto-marked Running as platform-provided resources; logic mirrors external-binding treatment. |
Sequence Diagram
sequenceDiagram
participant Agent
participant Manager
participant K8sAPI
participant HelmRunnerJob
Note over Agent,Manager: Normal sync loop
Agent->>Manager: "POST /v1/sync (agent_version, regime=kubernetes)"
Manager->>Manager: update_agent_metadata()
Manager->>Manager: build_agent_target() if target differs from reported
Manager-->>Agent: SyncResponse with agent_target helm payload
Note over Agent,K8sAPI: Self-update actuator
Agent->>K8sAPI: POST /apis/batch/v1/.../jobs (helm-runner Job)
K8sAPI-->>Agent: 201 Created (or 409 if exists)
K8sAPI->>HelmRunnerJob: schedule pod
HelmRunnerJob->>HelmRunnerJob: helm upgrade --reuse-values --atomic --wait
Note over Agent,Manager: Pod rolls (Recreate strategy)
HelmRunnerJob->>K8sAPI: Deployment updated with new image tag
Agent->>Agent: /readyz returns 503 until first sync
Agent->>Manager: POST /v1/sync with new agent_version
Agent->>Agent: "first_sync_completed=true, /readyz returns 200"
Note over Agent,Manager: Rejoin after state wipe
Agent->>Manager: POST /v1/initialize returns 409
Agent->>Manager: POST /v1/rejoin with name
Manager->>Manager: get_deployment_by_name() and mint new token
Manager-->>Agent: deployment_id and fresh token
Agent->>Agent: db.set_sync_token() persists for next restart
Comments Outside Diff (1)
-
crates/alien-manager/src/stores/sqlite/deployment.rs, line 506-550 (link)updated_atis not bumped during agent metadata writesupdate_agent_metadata(and similarlyset_target_agent_version) never touches theupdated_atcolumn. Any API consumer that usesupdated_atto detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider settingupdated_at = datetime('now')inside the same UPDATE, or at least doing so forset_target_agent_versionwhich represents an explicit admin action that callers would expect to be reflected in the record's timestamp.Prompt To Fix With AI
This is a comment left during a code review. Path: crates/alien-manager/src/stores/sqlite/deployment.rs Line: 506-550 Comment: **`updated_at` is not bumped during agent metadata writes** `update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 4
crates/alien-agent/src/loops/agent_upgrade.rs:296-305
**Silently swallows a failed Job's 409 — upgrade stalls with no warning**
When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.
To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires.
### Issue 2 of 4
crates/alien-manager/src/routes/sync.rs:1107-1112
`min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade.
```suggestion
Some(alien_core::sync::AgentTarget {
version: target_version.to_string(),
// Allow all existing agents to receive the upgrade instruction.
// Set to a meaningful floor (e.g. the first version that supports
// self-update) once the agent enforces this field.
min_supported_version: "1.0.0".to_string(),
binary: None,
helm,
})
```
### Issue 3 of 4
crates/alien-manager/src/routes/deployments.rs:876-892
`splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.
```suggestion
fn is_plausible_semver(v: &str) -> bool {
// Strip optional +build metadata first so the pre-release check can
// use a simple split on '-'.
let without_build = v.splitn(2, '+').next().unwrap_or(v);
let mut parts = without_build.splitn(2, '-');
let core = parts.next().unwrap_or("");
let core_ok = {
let segs: Vec<&str> = core.split('.').collect();
segs.len() == 3
&& segs
.iter()
.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
};
let rest_ok = parts.next().map_or(true, |r| {
!r.is_empty()
&& r.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
});
core_ok && rest_ok
}
```
### Issue 4 of 4
crates/alien-manager/src/stores/sqlite/deployment.rs:506-550
**`updated_at` is not bumped during agent metadata writes**
`update_agent_metadata` (and similarly `set_target_agent_version`) never touches the `updated_at` column. Any API consumer that uses `updated_at` to detect "something changed on this deployment" will miss both inventory syncs and operator target-version pins. Consider setting `updated_at = datetime('now')` inside the same UPDATE, or at least doing so for `set_target_agent_version` which represents an explicit admin action that callers would expect to be reflected in the record's timestamp.
Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile
| } | ||
| out | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use serde_json::json; | ||
|
|
||
| fn inputs() -> JobInputs { |
There was a problem hiding this comment.
Silently swallows a failed Job's 409 — upgrade stalls with no warning
When the Helm-runner Job exists but is in a Failed state (exhausted its backoffLimit), the Kubernetes API still returns 409 on a new create. The code logs at info level and returns Ok(()), so apply_agent_target sees success. The manager keeps emitting agent_target on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to ttlSecondsAfterFinished seconds (10 minutes). During that window, there is no warn log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.
To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a warn when the existing Job is in a Failed condition, so operators can investigate before the TTL window expires.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-agent/src/loops/agent_upgrade.rs
Line: 296-305
Comment:
**Silently swallows a failed Job's 409 — upgrade stalls with no warning**
When the Helm-runner Job exists but is in a `Failed` state (exhausted its `backoffLimit`), the Kubernetes API still returns 409 on a new create. The code logs at `info` level and returns `Ok(())`, so `apply_agent_target` sees success. The manager keeps emitting `agent_target` on every sync tick, but the agent will keep hitting the same 409 and silently doing nothing for up to `ttlSecondsAfterFinished` seconds (10 minutes). During that window, there is no `warn` log and no error propagated — an operator watching logs will only see "Upgrader Job already exists (409); leaving in place" with no indication the actual upgrade never happened.
To distinguish "another sync already queued it" (desired idempotency) from "it exists but is already dead," the handler should query the Job status and emit at least a `warn` when the existing Job is in a `Failed` condition, so operators can investigate before the TTL window expires.
How can I resolve this? If you propose a fix, please make it concise.| Some(alien_core::sync::AgentTarget { | ||
| version: target_version.to_string(), | ||
| min_supported_version: target_version.to_string(), | ||
| binary: None, | ||
| helm, | ||
| }) |
There was a problem hiding this comment.
min_supported_version is set equal to version, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the reported_version == Some(target_version) guard returns None first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like "1.0.0", or by explicitly recording the minimum agent version that is safe to self-upgrade.
| Some(alien_core::sync::AgentTarget { | |
| version: target_version.to_string(), | |
| min_supported_version: target_version.to_string(), | |
| binary: None, | |
| helm, | |
| }) | |
| Some(alien_core::sync::AgentTarget { | |
| version: target_version.to_string(), | |
| // Allow all existing agents to receive the upgrade instruction. | |
| // Set to a meaningful floor (e.g. the first version that supports | |
| // self-update) once the agent enforces this field. | |
| min_supported_version: "1.0.0".to_string(), | |
| binary: None, | |
| helm, | |
| }) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-manager/src/routes/sync.rs
Line: 1107-1112
Comment:
`min_supported_version` is set equal to `version`, which will break upgrades once the field is enforced. The field is described as "the agent should refuse the upgrade if its own version is older than this." Setting it to the target version means only an agent already at version X would accept the upgrade to X — but those agents never receive the instruction (the `reported_version == Some(target_version)` guard returns `None` first). Every agent that actually needs upgrading would refuse. The MVP intent (allow all current agents to upgrade) is better expressed as an empty floor like `"1.0.0"`, or by explicitly recording the minimum agent version that is safe to self-upgrade.
```suggestion
Some(alien_core::sync::AgentTarget {
version: target_version.to_string(),
// Allow all existing agents to receive the upgrade instruction.
// Set to a meaningful floor (e.g. the first version that supports
// self-update) once the agent enforces this field.
min_supported_version: "1.0.0".to_string(),
binary: None,
helm,
})
```
How can I resolve this? If you propose a fix, please make it concise.| fn is_plausible_semver(v: &str) -> bool { | ||
| let mut parts = v.splitn(2, |c| c == '-' || c == '+'); | ||
| let core = parts.next().unwrap_or(""); | ||
| let core_ok = { | ||
| let segs: Vec<&str> = core.split('.').collect(); | ||
| segs.len() == 3 | ||
| && segs | ||
| .iter() | ||
| .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit())) | ||
| }; | ||
| let rest_ok = parts.next().map_or(true, |r| { | ||
| !r.is_empty() | ||
| && r.chars() | ||
| .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-') | ||
| }); | ||
| core_ok && rest_ok | ||
| } |
There was a problem hiding this comment.
splitn(2, |c| c == '-' || c == '+') splits on the first occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. 1.4.0-rc.1+build.123) fail: the remainder becomes "rc.1+build.123" and the + isn't in the allowed character set [alphanumeric, '.', '-']. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.
| fn is_plausible_semver(v: &str) -> bool { | |
| let mut parts = v.splitn(2, |c| c == '-' || c == '+'); | |
| let core = parts.next().unwrap_or(""); | |
| let core_ok = { | |
| let segs: Vec<&str> = core.split('.').collect(); | |
| segs.len() == 3 | |
| && segs | |
| .iter() | |
| .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit())) | |
| }; | |
| let rest_ok = parts.next().map_or(true, |r| { | |
| !r.is_empty() | |
| && r.chars() | |
| .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-') | |
| }); | |
| core_ok && rest_ok | |
| } | |
| fn is_plausible_semver(v: &str) -> bool { | |
| // Strip optional +build metadata first so the pre-release check can | |
| // use a simple split on '-'. | |
| let without_build = v.splitn(2, '+').next().unwrap_or(v); | |
| let mut parts = without_build.splitn(2, '-'); | |
| let core = parts.next().unwrap_or(""); | |
| let core_ok = { | |
| let segs: Vec<&str> = core.split('.').collect(); | |
| segs.len() == 3 | |
| && segs | |
| .iter() | |
| .all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit())) | |
| }; | |
| let rest_ok = parts.next().map_or(true, |r| { | |
| !r.is_empty() | |
| && r.chars() | |
| .all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-') | |
| }); | |
| core_ok && rest_ok | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/alien-manager/src/routes/deployments.rs
Line: 876-892
Comment:
`splitn(2, |c| c == '-' || c == '+')` splits on the **first** occurrence of either separator, so versions with both a pre-release suffix and build metadata (e.g. `1.4.0-rc.1+build.123`) fail: the remainder becomes `"rc.1+build.123"` and the `+` isn't in the allowed character set `[alphanumeric, '.', '-']`. The function correctly rejects garbage input, but it silently blocks valid SemVer 2.0.0 strings that operators may reasonably try to pin.
```suggestion
fn is_plausible_semver(v: &str) -> bool {
// Strip optional +build metadata first so the pre-release check can
// use a simple split on '-'.
let without_build = v.splitn(2, '+').next().unwrap_or(v);
let mut parts = without_build.splitn(2, '-');
let core = parts.next().unwrap_or("");
let core_ok = {
let segs: Vec<&str> = core.split('.').collect();
segs.len() == 3
&& segs
.iter()
.all(|s| !s.is_empty() && s.chars().all(|c| c.is_ascii_digit()))
};
let rest_ok = parts.next().map_or(true, |r| {
!r.is_empty()
&& r.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '.' || c == '-')
});
core_ok && rest_ok
}
```
How can I resolve this? If you propose a fix, please make it concise.| /// POST a Job to the in-pod Kubernetes API. Returns Ok on 2xx, AlienError | ||
| /// otherwise. 409 (already exists) is treated as success — another sync | ||
| /// already kicked the Job and idempotency handles the retry. | ||
| async fn create_job(namespace: &str, body: &serde_json::Value) -> Result<()> { |
There was a problem hiding this comment.
these APIs should be in alien-k8s-clients
| pub agent_version: Option<String>, | ||
| /// `linux` / `macos` / `windows`. From `std::env::consts::OS`. | ||
| #[serde(default, skip_serializing_if = "Option::is_none")] | ||
| pub agent_os: Option<String>, |
There was a problem hiding this comment.
I think we have an existing enum for this somewhere in alien-core
| pub agent_os: Option<String>, | ||
| /// `x86_64` / `aarch64`. From `std::env::consts::ARCH`. | ||
| #[serde(default, skip_serializing_if = "Option::is_none")] | ||
| pub agent_arch: Option<String>, |
There was a problem hiding this comment.
I think we have an existing enum for this somewhere in alien-core
| /// How the agent is supervised — `os-service` (launcher) or `kubernetes` | ||
| /// (Helm). Detected at runtime from `KUBERNETES_SERVICE_HOST`. | ||
| #[serde(default, skip_serializing_if = "Option::is_none")] | ||
| pub regime: Option<AgentRegime>, |
There was a problem hiding this comment.
regime is a really weird name haha
| exec helm upgrade \"$RELEASE\" \"$CHART_REF\"{version_flag}{extra_args_flag} \ | ||
| --namespace \"$NAMESPACE\" \ | ||
| --reuse-values \ | ||
| --atomic --wait \ |
There was a problem hiding this comment.
Can you document why did we choose --atomic?
| --values /tmp/values.json" | ||
| ); | ||
|
|
||
| let body = serde_json::json!({ |
There was a problem hiding this comment.
What happens if creating this job fails? Or pulling the job's image fails? Or the command inside it fails? Do we report back those failures in any way?
How many retries do we support (I think the answer should be unlimited)? Is there exponential backoff or constant interval between retries?
| "name": job_name, | ||
| "namespace": inputs.namespace, | ||
| "labels": { | ||
| "app.kubernetes.io/managed-by": "alien-agent", |
There was a problem hiding this comment.
We recently improved our white-labeling support, need to make sure it fits there. See other places in alien-helm
| @@ -0,0 +1,453 @@ | |||
| //! Agent self-update actuator. When `/v1/sync` returns | |||
There was a problem hiding this comment.
Right now this will only work for Kubernetes deployments, right? No support for local service swap etc? What'll happen in local?
Summary
Closes ALIEN-59. Manager-driven agent self-update on Kubernetes, plus the two production-hardening fixes that came out of live e2e.
Feature
agent_upgradeloop inalien-agentconsumesagent_target.helmon/v1/syncand spawns a Helm-runner Job that runshelm upgrade --reuse-values --set runtime.image.tag=<target>under analien-agent-upgraderServiceAccount.alien-helmchart guardrails —required management.{url,name,token,deploymentId},/livez+/readyzprobes, Recreate strategy, namespace-scoped upgrader Role/Binding,upgrader.enabledopt-out.agent_version/agent_os/agent_arch/regime/agent_image_repository; K8s worker transport set tolocal; K8s ServiceAccount controller is skipped (chart owns it).alien-managerexposesPUT /v1/deployments/:id/target-agent-version;build_agent_targetemitsagent_target.helmwhen target ≠ reported and regime = kubernetes.Production hardening (would silently break otherwise; both found via live e2e)
fix(helm): enable agent runtime-data persistence by default— without a PVC, the self-update's pod roll wipes the agent'sdeployment_id/sync token, the new pod hits 409 from initialize, CrashLoopBackOffs, and helm--atomicrolls back. Documented opt-out.feat(agent,manager): /v1/rejoin endpoint— for the case where persistence is genuinely off (or the state directory is wiped for any other reason), the agent's initialize-409 falls through to a newPOST /v1/rejointhat re-acquires a deployment-scoped sync token over the existing row. NewRouterOptions::include_rejoin+skip_rejoin()builder lets multi-tenant embedders override, same pattern as/v1/initialize.fix(agent): preserve 409 from initialize so rejoin fall-through fires— the agent's.context(ConfigurationError)was clobbering the SDK's real 409 with a hardcoded 500. NewDeploymentNameAlreadyExistsvariant (409) lets the conflict surface so the rejoin branch actually triggers.Paired platform-side PR: https://github.com/alienplatform/platform/pull/new/yonatan/alien-59-self-update-for-the-alien-agent-across-all-pull-mode-2
Test plan
cargo test -p alien-agent -p alien-core -p alien-deployment -p alien-helm -p alien-manager --libagent_target→ helm-upgrade Job rolls the agent pod → PVC keepsdeployment_id→ new agent reports the new versionkubectl exec rm -rf /var/lib/alien-agent/*+ bounce pod → agent logsName already exists — assuming local state was wiped, rejoining…→ recovers the originaldeployment_idand stays runningclient-sdks/platform/rust/openapi*.jsonkept ours during the main merge to preserve the rejoin/target-agent-version schemas; main's addedManagerDomainBinding*+PackageDomainschemas in those files were dropped (they were generated from a private platform branch). Next platform-side openapi regen will bring them back.